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Web service applications are distributed processes that are composed of dynamically bounded ser- 
vices. In our previous work fl5\ , we have described a framework for performing runtime monitoring 
of web service against behavioural correctness properties (described using property patterns and con- 
verted into finite state automata). These specify forbidden behavior (safety properties) and desired 
behavior (bounded liveness properties). Finite execution traces of web services described in BPEL 
are checked for conformance at runtime. When violations are discovered, our framework automat- 
ically proposes and ranks recovery plans which users can then select for execution. Such plans for 
safety violations essentially involve "going back" - compensating the executed actions until an al- 
ternative behaviour of the application is possible. For bounded liveness violations, recovery plans 
include both "going back" and "re-planning" - guiding the application towards a desired behaviour. 
Our experience, reported in [16], identified a drawback in this approach: we compute too many 
plans due to (a) overapproximating the number of program points where an alternative behaviour is 
possible and (b) generating recovery plans for bounded liveness properties which can potentially vi- 
olate safety properties. In this paper, we describe improvements to our framework that remedy these 
problems and describe their effectiveness on a case study. 

1 Introduction 

A BPEL application is an orchestration of (possibly third-party) web services. These services, which can 
be written in a variety of languages, communicate through published interfaces. Third-party services can 
be dynamically discovered, and may be modified without notice. BPEL includes mechanisms for dealing 
with termination and for specifying compensation actions (these are defined on a "per action" basis, i.e., 
a compensation for booking a flight is to cancel the booking); yet, they are of limited use since it is hard 
to determine the state of the application after executing a set of compensations. 

In p5| , we proposed a framework for runtime monitoring and recovery that uses user-specified be- 
havioral properties to automatically compute recovery plans. This framework takes as input the target 
BPEL application, enriched with the compensation mechanism that allows us to undo some of the ac- 
tions of the program, and a set of properties (specified as desired/forbidden behaviors). When a violation 
of a property is detected at runtime, this framework outputs a set of ranked recovery plans and enables 
applying the chosen plan to continue the execution. Such plans for safety violations consist just of the 
"going back" part, until an alternative behavior of the application is possible. For bounded liveness viola- 
tions, recovery plans include both the "going back" and the "re-execution" part - guiding the application 
towards a desired behavior (such plans are schematically shown using a dashed line in Figure [4]). 

For example, consider the Travel Booking System (TBS) shown in Figure [2j which provides travel 
booking services over the web. In a typical scenario, a customer enters the expected travel dates, the 
destination city and the rental car location - airport or hotel. The system searches for available flights, 
hotel rooms and rental cars, placing holds on the resources that best satisfy the customer preferences. If 
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the customer chooses to rent a car at the hotel, the system also books the shuttle between the airport and 
the hotel. If the customer likes the itinerary presented to him/her, the holds are turned into bookings; 
otherwise, the holds are released. Some correctness properties of TBS are Pi: "there should not be 
a mismatch between flight and hotel dates" (expressing a safety property, or a forbidden behavior), 
P2'. "a car reservation request will be fulfilled regardless of the location (i.e., airport or hotel) chosen" 
(expressing a bounded liveness property or a desired behavior), and Py. "ground transportation must not 
be picked before a flight is reserved" (forbidden behavior). 

If the application exhibits a forbidden behavior, our framework suggests plans that use compensation 
actions to allow the application to "go back" to an earlier state at which an alternative path that potentially 
avoids the fault is available. We call such states "change states"; these include user choices and non- 
idempotent partner calls (i.e., those where a repeated execution with the same arguments may yield a 
different outcome) fl5| . For example, if the TBS system produces an itinerary with inconsistent dates, a 
potential recovery plan might be to cancel the current hotel booking and make a new reservation that is 
consistent with the booked flight's dates. 

Another possibility is that the system fails to produce a desired behavior when calls to some partners 
terminate, leaving it in an unstable state. In such cases, our framework computes plans that redirect the 
application towards executing new activities, those that lead to completion of the desired behavior. For 
example, if the car reservation partner for the hotel location fails (and thus the "shuttle/car at hotel" com- 
bination is not available), the recovery plans would be to provide transportation to the user's destination 
(her "goal" state) either by trying to book another car at the hotel, or by undoing the shuttle reservation 
and try to reserve the rental car from the airport instead. 

Effectiveness and scalability of a recovery framework like ours is in (quickly) generating a small 
number of highly relevant plans. While our framework can generate recovery plans as discussed above, 
in our experience with TBS, reported in fl6| , we observed that it generates too many plans. At least two 
factors contribute to this problem: 

1. we over- approximate the set of change states and thus offer plans where compensation cannot 
produce an alternative path through the original system to avoid an error; and 

2. some recovery plans for desired behavior violations will (necessarily) lead to violations of forbid- 
den behaviors when executed, and thus should not be offered to the user. 

In this paper, we present two improvements that try to address these issues. The first improvement 
identifies the non-idempotent service calls that are relevant to the violation, i.e., their execution may 
affect the control flow of the current execution. The second improvement identifies computed plans 
that always lead to violations of forbidden behaviors, as the execution of these plans will cause another 
runtime violation and thus they should not be offered to the user. 

In what follows, we give a brief overview of the framework (Sec. [2]) and our previous experience 
with the Travel Booking System (TBS) (Sec. [3]). In Sec. |4j we use the TBS example to discuss the two 
plan generation improvements as well as their effectiveness. We conclude in Sec. [5] with a summary of 
the paper, related work and suggestions for future work. 

2 Overview of the Approach 

We have implemented our RUntime MOnitoring and Recovery framework (RuMoR) using a series of 
publicly available tools and several short (200-300 lines) new Python or Java scripts. The architecture of 
our tool is shown in Fig.[Ta| where components and artifacts have been grouped by phase (Preprocessing, 
Monitoring or Recovery). In the Preprocessing phase, the correctness properties specifying desired and 
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Figure 1: a) Architecture of the tool; b) Recovery plan generation for violating a desired behavior. 



forbidden behaviors are turned into finite-state automata (monitors). We use the WS-Engineer extension 
for LTSA (6) to translate the BPEL application into a Labeled Transition System (LTS), enriched with 
compensation actions (model). We also compute change and goal states during this phase. 

The Monitoring phase is implemented on top of the IBM WebSphere Process ServeiQ a BPEL- 
compliant process engine for executing BPEL processes. Monitoring is done in a non-intrusive manner 
- the Event Interceptor component intercepts runtime events and sends them to the Monitor Manager, 
which updates the state of the monitors. The use of high-level properties allows us to detect the violation, 
and our event interception mechanism allows us to stop the application right before the violation occurs. 
RuMoR does not require any code instrumentation, does not significantly affect the performance of the 
monitored system (see 1 17 1), and enables reasoning about partners expressed in different languages. 

During the Recovery phase, the Plan Generator component generates recovery plans using SAT- 
based planning techniques (see fT5| for details). In the case of forbidden behavior violations, the Plan 
Generator determines which visited change states are reachable by executing available compensation 
actions. Multiple change states can be encountered along the way, thus leading to the computation of 
multiple plans. In the case of desired behavior violations, the Plan Generator tries to solve the following 
planning problem: "From the current state in the system, find all plans (up to length k) to achieve the 
goal, that go through a change state". The actions that a plan can execute are defined by the application 
itself; thus, the domain of the planning problem is the LTS model of the application. The initial and goal 
states of the planning problem are the current error state and the precomputed goal states, respectively. 



The process for computing recovery plans for desired behavior violations is shown in Fig. lb Ru- 
MoR uses Blackbox JTTJ, a SAT-based planner, to convert the planning problem into a SAT instance. 
The maximum plan length is used to limit how much of the application model is unrolled in the SAT 
instance, effectively limiting the size of the plans that can be produced. Multiple plans are generated 
by modifying the initial SAT instance: new plans are obtained by ruling out those computed previously. 
Plans are extracted from the satisfying assignments produced by the SAT solver SAT4J and converted 
into BPEL for displaying and execution. SAT4J is an incremental SAT solver, i.e., it saves results from 
one search and uses them for the next. We take advantage of this for generating multiple plans. 

All computed plans are presented to the application user through the Violation Reporter component. 
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Figure 2: BPEL implementation of the Travel Booking System. 



It generates a web page snippet with violation information as well as a web page for selecting a recovery 
plan. The application developer must include this snippet in the default error page, so that the computed 
recovery plans are displayed as part of the application when an error is detected. The Plan Executor exe- 
cutes the selected plan using dynamic workflows (18). RuMoR takes advantage of their implementation 
as part of IBM WebSphere. 

3 Monitoring The Travel Booking System 
3.1 BPEL Model 

Figure|2]shows the BPEL implementation of this system. TBS interacts with three partners (FlightSystem, 
HotelSystem and CarSystem), each offering the services to find an available resource (flight, hotel room, 
car and shuttle), place a hold on it, release a hold on it, book it and cancel it. Booking a resource is 
compensated by canceling it and placing a hold is compensated by a release. All other activities can be 
simply undone, i.e., they do not have explicit compensation actions. All external service calls are non- 
idempotent. In the rest of this paper, bf, bh and he represent the service calls bookFlight, bookHotel and 
holdCar, respectively. 

The workflow begins by <receive>'ing input (receivelnput), followed by <flow> (i.e., parallel 
composition) with two branches, since the flight and hotel reservations can be made independently. The 
branches are labeled ® and ©: ®) find and place a hold on a flight, ©) place a hold on a hotel room 
(this branch has been simplified in this case study). If there are no flights available on the given dates, 
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Figure 3: Monitors: (a) A 1 , (b) A 2 and (c) A 3 . Red states are shaded horizontally, green states are shaded 
vertically, and yellow states are shaded diagonally. 

the system will prompt the user for new dates and then search again (up to three tries). After making 
the hotel and flight reservations, the system tries to arrange transportation (see the <pick> (i.e., making 
the external choice) activity labeled ©): the user <pick>'s a rental location (pickAirport or pickHotel, 
abbreviated as pa and ph, respectively) and the system tries to place holds on the required resources (car 
at airport, or car at hotel and a shuttle between the airport and hotel). 

Once ground transportation has been arranged, the reserved itinerary is displayed to the user (display— 
TravelSummary), and at this point, the user must <pick> to either book or cancel the itinerary. The 
book option has a <flow> activity that invokes the booking services in parallel, and then calls two lo- 
cal services: one that checks that the hotel and flight dates are consistent (checkDates), and another 
that generates an invoice (generateln voice). The result of checkDates is then passed to local services 
to determine whether the dates are the same (sameDates) or not (notSameDates, abbreviated as nsd). 
The cancel option is just a <flow> activity that invokes the corresponding release services. Whichever 
option is picked by the user, the system finally invokes another local service to inform the user about the 
outcome of the travel request (informCustomer). 

3.2 Monitoring Behavioral Properties 

In general, the framework described in (T5| allows the system developer to express desired and forbid- 
den behavior as bounded liveness and safety properties, respectively. These are expressed using property 
patterns Q, converted into quantified regular expressions (QRE) jT3j and then become monitoring au- 
tomata. For example, the TBS properties described in Sec.[T]are expressed as follows: 

Pi: Absence of a date mismatch event (notSameDate) After both a flight and hotel have been booked 
(bookFlight and bookHotel, in any order). 

P21 Globally hold a car (holdCar) in Response to a rental location selection (pickHotel or pickAirport). 

P3: Existence of flight reservation (hold Flight) Before the rental location selection (pickHotel or pickAirport). 

In our framework, monitors are finite state automata that accept bad computations. In order to 
facilitate recovery, we assign colors to the monitor states: 



Accepting states are colored red, signaling violation of the property. State 5 of Fig. 3a state 3 in 



Fig. pbl and state 2 in Fig. 3c are red states 



Yellow states are those from which a red state can be reached through a single transition. State 4 



in Fig.|3a| state 2 in Fig.[3bJ and state 1 in Fig. [3c] are yellow states. 

Green states are states that can serve as good places to which a recovery plan can be directed. We 
define green states to be those states that are not red or yellow, but that can be reached through a 



single transition from a yellow state. State 1 in Fig.pbl and state 3 in Fig. 3c are green states 
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Table 1: Plan generation data. "-" mark cases which are not applicable, such as references to SAT for 
recovery from forbidden behavior violations. 



The details of the QRE translation and the formal definition of state colors can be found in fl6| . 

The monitor Ai in Fig. [3a] enters its error state (5) when the application determines that the hotel 
and flight booking dates do not match (the hotel and flight can be booked in any order). The monitor 
A 2 in Fig. [3b] represents property P2'. if the application terminates (i.e., sends the TER event) before he 
appears, the monitor moves to the (error) state 3. State 1 is a good state since the monitor enters it once 
a car has been placed on hold (he). The monitor A 3 in Fig. [3c| represents property P3: it enters the good 
state 3 once a hold is placed on a flight (hf), and enters its error state 2 if the rental location (pa or ph) is 
picked before a flight is reserved (hf). 



3.3 From BPEL to LTS 

In order to reason about BPEL applications, we need to represent them formally, so as to make precise 
the meaning of "taking a transition", "reading in an event", etc. Several formalisms for representing 
BPEL models have been suggested (7|[T0j[T4). In this work, we use Foster's (5j approach of using a 
Labeled Transition System (LTS) as the underlying formalism. 

Definition 1 (Labeled Transition Systems) A Labeled Transition System LTS is a quadruple (5, Z, 8 , /), 
where S is a set of states, Z is a set of labels, SCSxHxSisa transition relation, and I C S is a set of 
initial states. 

Effectively, LTSs are state machine models, where transitions are labeled whereas states are not. We 
often use the notation s —> s f to stand for (s, a, s f ) E 8. An execution, or a trace, of an LTS is a sequence 
T — soaoSiaiS2...a n -is n such that V/,0 <i <n, Si E S, ai E Z and S[ s^+i. 

The set of labels Z is derived from the possible application events: service invocations and returns, 
messages, scope entries, and conditional valuations. (5| specifies the mapping of all BPEL 1.0 activities 
into LTS. Conditional activities like <if> and <while> statements are represented as states with two 
outgoing transitions, one for each valuation of the condition. <pick> is also a conditional activity, but 
can have one or more outgoing transition, one for each <onMessage> branch. <sequence> and <flow> 
activities result in the sequential and parallel composition of the enclosed activities, respectively. 

In fT5| , we describe how we augmented Foster's translation so that we can model termination, as 
well as BPEL compensation. According to our translation, the TBS LTS has 52 states and 67 transitions, 
and |Z| = 33. 20 of the BPEL activities (highlighted with a * symbol in Figure [2]) yield a total of 35 
change states in the LTS. 
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3.4 Experience: Recovery from a safety property violation 

We generated a recovery plan for the following scenario (called trace t\ , of length k = 21) which violates 
property P\ : The application first makes a hotel reservation (hold Hotel) and then prompts the user for new 
travel dates (updateTravel Dates), since there were no flights available on the current travel dates. The 
car rental location is the airport (pickAirport). The system displays the itinerary (displayTravelSummary) 
but the user does not notice the date inconsistency and decides to book it. The TBS makes the bookings 
(bookFlight, bookHotel and bookCar) and then checks for date consistency (checkDates). Since the 
dates are not the same (notSameDates), we detect violation of P\ and initiate recovery. 

We generated plans starting with length k = 5 and going to k = 30 in increments of 5. In order to 
generate all possible plans for each k, we chose n - the maximum number of plans generated - to be 
MAX_INT. Table [T] summarizes the results. A total of 13 plans were generated, and the longest plan, 
which reaches the initial state, is of length 21 (and thus the rows corresponding to k = 25 and k = 30 
contain identical information). Since t\ violates a safety property, no SAT instances were generated, and 
the running time of the plan generation is trivial. 

The following plans turn t\ into a successful trace: p\ - cancel the flight reservation and pick a new 
flight using the original travel dates, and p\ - cancel the hotel reservation and pick a new hotel room 
for the new travel dates. Our tool generated both of these plans, but ranked them 11th and 12th (out 
of 13), respectively. They were assigned a low rank due to the interplay between the following two 
characteristics of our case study: (i) the actual error occurs at the beginning of the scenario (in the flight 
and hotel reservation <flow>), but the property violation was only detected near the end of the workflow 
(in the book flow), and (ii) t\ passes through a relatively large number of change states, and thus many 
recovery plans are possible. 

The first of these causes could be potentially fixed by writing "better" properties - the ones that 
allows us to catch an error as soon as it occurs. We recognize, of course, that this can be difficult to do. 
The second stems from the fact that not all service calls marked as non-idempotent are relevant to P\ or 
its violation. In Sec. |4.1[ we present a method for identifying the non-idempotent service calls that are 
relevant to the violation, i.e., their execution may affect the control flow of the current execution. By 
reducing the number of change states considered, fewer recovery plans will be generated. 



3.5 Experience: Recovery from a bounded liveness property violation 

The following scenario (we call it trace t2, with length 14) violates property Pz. Consider an execution 
where the user reserves a hotel room (reserve Hotel), and a flight (reserveFlight). He then chooses to rent 
a car at the hotel (pickHotel), but no cars are available at that hotel. TBS makes flight, hotel and shuttle 
reservations (holdFlight and holdHotel), but never makes a car reservation (holdCar). The user does not 
notice the missing reservation in the displayed itinerary (displayTravelSummary) and decides to book it. 
The TBS tries to complete the bookings, first booking the hotel (bookHotel) and then the car (bookCar). 
When the application attempts to invoke bookCar, the BPEL engine detects that the application tries 
to access a non-initialized process variable (since there is no car reservation), and issues a TER event. 
Rather than delivering this event to the application, we initiate recovery. 

We are again using n — MAX_INT and varying k between 5 and 30, in increments of 5, summarizing 
the results in Table [T] The first thing to note is that our approach generated a relatively large number of 
plans (over 60) as k approached 30. While in general the further we move away from a goal link, the 
more alternative paths lead back to it, this was especially true for TBS which had a number of <flow> 
activities. The second thing to note is that our analysis remained tractable even as the length of the plan 
and the number of plans generated grew (around 1 min for the most expensive configuration). 
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Executing one of the following plans would leave TBS in a desired state: p\ - attempt the car rental 
at the hotel again, and p\ - cancel the shuttle from the airport to the hotel and attempt to rent a car at the 
airport. Unlike t\, the error in this scenario was discovered soon after its occurrence, so plans p\ and p\ 
are the first ones generated by our approach. p\ actually corresponds to two plans, since the application 
logic for reserving a car at a hotel is in a <flow> activity, enabling two ways of reaching the same goal 
link. Plan p\ was the 3rd plan generated. 

The rest of the plans we generated compensate various parts of t2, and then try to reach one of the 
three goal links. While these longer plans include more compensations and are ranked lower than p\ and 
p\, we still feel that it may be difficult for the user to have to sift through all of them. As in the case 
of safety property violations, we can reduce the number of plans generated by picking relevant change 
states. Furthermore, some of the computed recovery plans, when executed, lead to violations of safety 
properties, and thus should not be offered to the user. In Sec. |4.2[ we present a method for identifying 
such recovery plans that always lead to violations of safety properties. 



4 Reducing the Number of Generated Plans 

As discussed above, our tool produces a set of recovery plans for each detected violation. However, in 
some cases this set includes unusable plans. In this section, we look at techniques for filtering out two 
types of unusable plans: those that require going through unnecessary change states, where re-executing 



the partner call cannot affect the (negative) outcome of the trace (see Sec. 4.1), and plans that fix a 



liveness property at the expense of violating some safety properties (see Sec. 4.2). 



4.1 Relevant Change States 

As discussed before, change states are application states from which flow -changing actions can be ex- 
ecuted. These are user choices (<pick>), modeling the <flow> activity, and service calls whose out- 
comes are not completely determined by their input parameters (to which we refer as non-idempotent). 
For example, getAvailableFlights is a non-idempotent service call (and leads to the identification of var- 
ious change states), since each new invocation of the service, with the same travel dates, may return 
different available flights. Non-idempotent service calls are identified by the developer. 

Let us reexamine the trace t\. This trace visited 13 change states, of which 11 correspond to non- 
idempotent service calls. The two flow activities executed on the trace identify two change states that 
coincide with two states already identified using non-idempotent service calls (hold Hotel and bookCar). 
The remaining two change states correspond to the two <pick> activities on the trace (choice between 
rental locations, and choice between booking/canceling the itinerary). 

As <pick> and <flow> activities are flow-altering actions by definition, the change states identified 
by these activities are always relevant to the current violation. On the other hand, not all service calls 
marked as non-idempotent are relevant, i.e., their execution cannot modify the current execution trace. 
For example, bookFlight and bookHotel are both non-idempotent service calls that appear in t\, and so 
define two recovery plans. However, these two plans are not useful: after their execution, the application 
is forced to complete the execution of t\ in its entirety. This happens because none of the later control 
predicates depend on the output produced by these service calls. This example suggests a definition of 
relevant change state: 

Definition 2 (Relevant Change State) A change state is relevant if it is identified by: 1) a <flow> or 
<pick> activity, or 2) a non-idempotent service call, and a variable that appears in a control activity is 
data dependent on the outcome of this service call. 
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Table 2: Predicates appearing on both traces, and the non-idempotent service calls that affect their values. 



In order to carry out the data dependency analysis on the application LTS, we must first deter- 
mine which BPEL activities define and use process variables, and how to map this information to 
the LTS model. <invoke> and <assign> activities both define and use variables. For example, the 
getAvailableFlights service call takes as input the travel Request variable (use) and modifies the available- 
Flights variable (definition). Both <while> and <if> activities use the variables that appear in the ac- 
tivity's predicate. <flow> and <pick> do not use or define variables. 

We can now define the following two sets of variables for each LTS transition (s —> s f ): the set of 
variables defined by the action a (Def (s —> s f )), and the set of variables used by action a (Use(s —¥ s')). 

{{inVar} if a represents <invoke . . . input Variable = "inVar" . . . > 
{fromVar} if a represents <assign><from>fromVar</from>. .. > and 
otherwise 

{outVar} if a represents <invoke . . . output Variable = "outVar" . . . > 
{toVar} if a represents <assign><to>toVar</to>. . . > 

< {vi , V2 , . . . v n } if a represents a < while > or <if > branch, and {vi , V2 , . . . v n } 
appear in the corresponding <condition> 
otherwise 

The set of variable definitions that occur on a trace is the union of the definitions that occur on the 
individual transitions of the trace: for a trace T = ^o^o^i^i^2 • ..a n -\s n , Def(T) = (J^DefC^- s i+\)- 
Now we can define direct data dependency: a transition v is directly data dependent on another transition 
u if and only if v uses a variable defined by u, and there is a path from u to v where this variable is not 
redefined. Formally, 

Definition 3 (Directly Data Dependent) A transition (q q 1 ) is directly data dependent on a tran- 
sition (p — % p') if and only if there is a trace T — s$a$s\a\S2 • • -^n-iSn such that p' — sq, q — s n and 
(Def(p p>) Use(q A q')) -DeftT) + 0. 

For example, the <if> activity labeled © and the hold Flight service call labeled © are both directly 
data dependent on the getAvailableFlights service calls at ©and ®. 

We can now define data dependency, a transition v is data dependent on another transition u if there 
exists a path from u to v that can be divided into sections, where each section is directly data dependent on 
a previous section. For example, the bookFlight service call is directly data dependent on the invocation 
of hold Flight, so bookFlight is data dependent on both invocations of the getAvailableFlights service. 

Now apply data dependency analysis on trace t\. This trace executed four control activities: 1) the 
<while> labeled ®, 2) the <if > labeled ©, 3) the <if > labeled ©, and 4) the <if > labeled ®. Table^ 
lists the corresponding predicates, as well as the non-idempotent service calls that can affect the values 
of these predicates. For example, the <while> condition is availableFlights <= 0&&tries < 3. This use 
of the availableFlights variable is directly data dependent on both appearances of the non-idempotent 



Formally: 

Def (j s f ) = 

Use(s s') = 
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Figure 4: A schematic 
view on plan genera- 
tion and filtering. 
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Table 3: Results of applying both improvements (separately, and then com- 
bined) to the TBS case study. "-" marks cases which are not applicable, since 
the second improvement only applies to bounded liveness properties. 



getAvailableFlights service. On the other hand, the tries variable is not data dependant on any non- 
idempotent service calls, since it is updated by an <assign> statement inside the <while> activity. 

The data dependency analysis for predicates 2 and 3 is similar to that of predicate 1, and the results 
of the analysis are summarized in Table [2j In the case of predicate 4, the variable consistent is directly 
data dependent on the idempotent service checkDates, which is directly data dependent on the non- 
idempotent service calls holdHotel and holdFlight (since these services modify reservation Data, the 
input parameter of the checkDates service). 

So, only five of the 10 non-idempotent service calls on trace t\ are identified as relevant. The <flow> 
and <pick> activities on trace t\ identify another three relevant change states, so RuMoR now generates 
a total of (k = 5), 2 (k = 10), 5 (Jfc = 15) and 8 (k = 20, 25, 30) plans for this trace. The desired plans 
p\ and p l B are still generated (at k = 20,25,30), but are now ranked 6th and 7th (instead of 11th and 
12th). These two plans are still ranked low because of the amount of compensation they require, but by 
omitting plans that cannot alter the control flow of the current execution, we reduced the number of plans 
presented to the user by 50%. 

We carried out the same analysis on trace six of the original 10 change states are marked as 
relevant. Since trace t% visits the same <pick> and <flow> activities as t\, four of the relevant change 
states are those identified by these activities, the remaining two relevant change states correspond to the 
non-idempotent service calls associated to predicates 5 and 6, also summarized in Table [2j 



4.2 Avoiding Forbidden Behaviors 

Our second method aims to remove those plans that result in the system performing behavior which is 
explicitly forbidden. That is, we use safety properties to help filter recovery plans for liveness properties. 
This process is outlined in Figure |4} given a failing trace T, we compute a plan P which first "undoes" 
the trace until a change state and then computes an alternative path to a certain goal (shown using dashed 
lines). P is unsuitable if the path from the initial state going through this change state and continuing via 
the computed alternative path towards the goal (shown using a thick line and denoted T P ) is forbidden. 
That is, there exists a safety monitor A; which enters an error state when executed on T P . 

The simplest method, presented here, applies the filtering w.r.t. safety properties after the set of 
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recovery plans has already been produced. That is, given a trace T and a plan P, we can compute T P and 
simulate every safety monitor on it, removing P from consideration if any monitor fails. 

The path from the initial state to the change state used in P can be very long, and thus we feel 
that simulating each monitor on the entire trace T P is very inefficient. We also cannot execute monitors 
backwards from the error state of T along the "undo" part of P: while our monitors are deterministic, their 
inverse transition relations do not have to be deterministic, making the execution in reverse problematic. 

Instead, we aim to maintain enough data during the execution of the trace T in order to be able to 
restart monitors directly from the change state, moving along the new, recomputed path of the plan. To 
do so, as T executes, we record the states of all monitors in the system in addition to the states and 
transitions of the application. Thus, for each state s of the application reached during the execution of 
trace T, we store a tuple (s,£a 1? •••^A n ) ? where s Ai is a state of the monitor Ai as the application is in 
state s. To check whether P is a valid plan, we go directly to the change state s c in P, extract the tuple 
(s , C3 s'a 1 ,s , a 2 , ...,50 storec l as part of T and then simulate each safety monitor A ± starting it from the state 
s Ai along P which starts at state s c . 

As an example, consider the TBS system and trace t2, described in Sec. [53} violating the property P2. 
Our approach produces over 60 plans to recover from this violation, for plan lengths k > 25 (see Table[T]). 
Consider the plan that goes back all the way until encountering the change state associated with the call 
to getAvailableFlight, canceling the booked flights on the way. Afterwards, this plan attempts to re-book 
a flight, but fails to do so. It continues executing, and tries to pick a car at the airport instead. However, 
this plan violates property P3 (i.e., monitor A 3 would enter its error state upon seeing an action pa). Thus, 
we automatically filter this plan out and do not present it to the user. 

Overall, applying this approach to recovery for trace t2 reduces the number of plans from over 60 to 
41. Furthermore, combining it with the computation of the relevant change states, the number of plans is 
further reduced to 23 (see Table [3]). While this number is still relatively large, it presents a considerable 
improvement and enables the user to pick a desired plan more easily. 



5 Summary and Related Results 

In this paper, we briefly summarized the RuMoR approach to runtime monitoring and recovery of web 
services w.r.t. behavioral properties expressed as desired or forbidden behaviors. We have also described 
two optimizations to the recovery plan generation: reducing the number of change states and using 
monitors to filter those plans which represent forbidden behaviors. 

Halle and Villemaire in (8||9), suggest a monitoring framework where data- aware properties are 
written in LTL enriched with first-order quantifications. Generating automata for runtime monitoring 
w.r.t. such an expressive language is significantly more complex than in our framework. Recovery in the 
work of Halle and Villemaire is based on executing a predefined function, associated with an individual 
property - i.e., all failures of the same property are treated in the same way, statically. In contrast, our 
method is dynamic and generates recovery plans customized for each error. 

An emerging research area in recent years is that of self-adaptive and self-managed systems (see 
|T]|3][T2) for a partial list). A system is considered self-adaptive if it is capable of adjusting itself in 



response to a changing environment. This approach is different from ours, since our framework does not 
change the system itself, and recovery plans are discovered and executed using the original application. 

The work of Carzaniga et al. (2) is the closest to ours in spirit. It exploits redundancy in web ap- 
plications to find workarounds when errors occur, assuming that the application is given as a finite-state 
machine, with an identified error state as well as the "fallback" state to which the application should 
return. The approach generates all possible recovery plans, without prioritizing them. In contrast, our 



14 



Monitoring and Recovery of BPEL Applications 



framework not only detects runtime errors but also calculates goal and change states and in addition 
automatically filters out unusable recovery plans. 

Our work in this space is on-going. Specifically, we are interested in further case studies, optimized 
usage of SAT solving for better plan generation (e.g., so that we encode forbidden behaviors as part of 
the SAT problem rather than filtering them out after the plan has been generated), ways to harvest and 
effectively express behavioral properties, since this is key to the usability of our approach. 
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